MMMLU (Multilingual MMLU)

Testing LLMs across 14 languages and 57 subjects — OpenAI’s professionally human-translated benchmark for multilingual knowledge and reasoning

Published

September 7, 2025

Keywords: MMMLU, Multilingual MMLU, multilingual benchmark, LLM evaluation, MMLU, OpenAI, multilingual reasoning, low-resource languages, professional translation, cross-lingual evaluation, knowledge assessment, Yoruba, Swahili, Arabic, Bengali

Introduction

Most AI benchmarks evaluate LLMs in English only — but billions of people around the world interact with AI in their native language. How do we know if a model that scores 90% on English knowledge tests can perform equally well in Arabic, Bengali, Swahili, or Yoruba?

MMMLU (Multilingual Massive Multitask Language Understanding) answers this question directly. Created by OpenAI, it takes the widely used MMLU benchmark — 57 subjects spanning elementary to professional-level knowledge — and translates the entire test set into 14 languages using professional human translators. The result is a rigorous, high-quality multilingual evaluation that exposes dramatic performance gaps between high-resource and low-resource languages.

“Relying on human translators for this evaluation increases confidence in the accuracy of the translations, especially for low-resource languages like Yoruba.” — OpenAI MMMLU Dataset

graph LR
    A["MMLU<br/>(English only)<br/>57 subjects"] --> B["Limited to<br/>English-speaking<br/>evaluation"]
    B --> C["MMMLU<br/>14 languages<br/>Human-translated"]
    C --> D["True multilingual<br/>knowledge<br/>assessment"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is MMMLU?

MMMLU is a multilingual extension of the MMLU (Massive Multitask Language Understanding) benchmark. It contains the complete MMLU test set — covering 57 subjects from elementary mathematics to advanced professional topics like law, medicine, and computer science — professionally translated into 14 languages by human translators.

The benchmark ensures that translations are accurate and culturally appropriate, particularly for low-resource languages where machine translation quality is unreliable. This makes MMMLU the gold standard for evaluating whether LLMs can reason and recall knowledge across linguistic boundaries.

Languages Covered

Language	Locale	Resource Level
Arabic	AR_XY	Medium
Bengali	BN_BD	Low
Chinese (Simplified)	ZH_CN	High
French	FR_FR	High
German	DE_DE	High
Hindi	HI_IN	Medium
Indonesian	ID_ID	Medium
Italian	IT_IT	High
Japanese	JA_JP	High
Korean	KO_KR	High
Portuguese (Brazil)	PT_BR	High
Spanish	ES_LA	High
Swahili	SW_KE	Low
Yoruba	YO_NG	Low

Key Characteristics

Feature	Details
Base benchmark	MMLU (57 subjects, ~14,000 test questions)
Languages	14 (professional human translations)
Total questions	~197,000 (14 × ~14,000)
Question format	Multiple-choice (4 options)
Evaluation	Zero-shot, chain-of-thought
Subjects	Elementary math to professional law, medicine, CS
Translation quality	Professional human translators (not machine translation)
License	MIT

Who Built It?

MMMLU was created by OpenAI as part of their commitment to improving multilingual AI capabilities. The translations were commissioned using professional human translators — a deliberate choice over machine translation to ensure accuracy, especially for low-resource languages like Yoruba and Swahili.

The original MMLU benchmark that MMMLU builds upon was created by:

Dan Hendrycks — UC Berkeley (now Center for AI Safety)
Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika — UC Berkeley
Dawn Song, Jacob Steinhardt — UC Berkeley

MMLU was published at ICLR 2021 and quickly became one of the most widely used benchmarks in AI evaluation.

Publication and Resources

Resource	Link
MMMLU Dataset	huggingface.co/datasets/openai/MMMLU
Evaluation code	github.com/openai/simple-evals
Original MMLU paper	arxiv.org/abs/2009.03300
Community Leaderboard	Multilingual MMLU Benchmark Leaderboard

What Skills Does It Test?

MMMLU tests the same broad spectrum of knowledge and reasoning as MMLU — but crucially, it measures whether models can perform these tasks in non-English languages. This reveals both knowledge depth and cross-lingual transfer capabilities.

graph TD
    MMMLU["MMMLU<br/>Multilingual Knowledge"] --> A["STEM<br/>Math, Physics,<br/>Computer Science"]
    MMMLU --> B["Humanities<br/>History, Philosophy,<br/>Literature"]
    MMMLU --> C["Social Sciences<br/>Economics, Law,<br/>Psychology"]
    MMMLU --> D["Professional<br/>Medicine, Law,<br/>Engineering"]
    MMMLU --> E["Cross-Lingual<br/>Transfer<br/>Same knowledge,<br/>14 languages"]
    MMMLU --> F["Low-Resource<br/>Language<br/>Understanding<br/>Yoruba, Swahili,<br/>Bengali"]

    style MMMLU fill:#e74c3c,color:#fff,stroke:#333
    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#6cc3d5,color:#fff,stroke:#333

Capability	What MMMLU Tests
Multilingual knowledge recall	Can the model access the same factual knowledge in Arabic as in English?
Cross-lingual reasoning	Can the model solve multi-step problems when the question is in Japanese or Hindi?
Low-resource language fluency	How much performance degrades for Yoruba, Swahili, and Bengali vs. high-resource languages
Subject breadth	57 subjects from elementary to professional level — same questions across all languages
Translation robustness	Whether subtle linguistic differences affect model accuracy
Inclusive AI assessment	Real-world readiness for global deployment across diverse language communities

The 57 MMLU Subject Categories (Grouped)

Domain	Example Subjects
STEM	Abstract Algebra, Astronomy, College Mathematics, Computer Science, Electrical Engineering, Physics
Humanities	Formal Logic, Jurisprudence, Moral Disputes, Philosophy, Prehistory, World Religions
Social Sciences	Econometrics, Human Sexuality, Marketing, Public Relations, Sociology, US Foreign Policy
Professional	Clinical Knowledge, Medical Genetics, Professional Accounting, Professional Law, Professional Medicine
Other	Global Facts, Miscellaneous, Nutrition, Virology

Current Leaderboard

The table below shows the average accuracy across all 14 languages for each model, as published in the official OpenAI MMMLU benchmark results.

Source: OpenAI Simple Evals — MMMLU Results (consulted March 28, 2026). Evaluation uses zero-shot chain-of-thought prompting.

Average Accuracy Across 14 Languages

Rank	Model	Avg. Accuracy (%)
1	o3 (high)	88.8
2	o1	87.7
3	o4-mini (high)	85.2
4	GPT-4.5 Preview	85.1
5	GPT-4.1	83.7
6	o3-mini (high)	80.7
7	GPT-4o (Nov 2024)	81.4
8	GPT-4.1 Mini	78.5
9	GPT-4o Mini	70.5
10	GPT-4.1 Nano	66.9

Performance by Language (Top Model: o3-high)

The table below reveals the language performance gap — even for the best model, accuracy ranges from 91.2% (Italian) to just 78.0% (Yoruba).

Language	o3 (high)	o1	GPT-4.1	GPT-4.1 Nano
Italian	91.2%	89.7%	86.9%	73.4%
Spanish	91.1%	89.9%	87.6%	74.8%
Portuguese (Brazil)	91.0%	89.5%	87.0%	74.1%
French	90.6%	89.3%	87.0%	73.9%
German	90.5%	89.0%	85.5%	72.2%
Arabic	90.4%	89.0%	84.4%	65.9%
Hindi	89.8%	88.3%	84.2%	62.9%
Indonesian	89.8%	88.6%	85.9%	71.4%
Chinese (Simplified)	89.3%	88.9%	86.1%	71.0%
Korean	89.3%	88.2%	84.9%	67.9%
Japanese	89.0%	88.9%	85.6%	69.0%
Bengali	87.8%	87.3%	82.7%	58.3%
Swahili	86.0%	85.4%	79.5%	56.6%
Yoruba	78.0%	75.4%	64.7%	45.5%

Key takeaways:

Massive gap between high-resource and low-resource languages — o3 (high) scores 91.2% on Italian but only 78.0% on Yoruba, a 13+ point gap
Yoruba is the hardest language for all models — GPT-4.1 Nano drops to 45.5%, barely above random chance (25%)
Reasoning models (o-series) lead the rankings — o3 (high) at 88.8% average outperforms the best non-reasoning model GPT-4.5 Preview at 85.1%
Smaller models suffer disproportionately on low-resource languages — GPT-4.1 Nano drops 28 points from Italian (73.4%) to Yoruba (45.5%)
European languages cluster together — Italian, Spanish, Portuguese, French, and German all score within ~1% of each other

Where to Explore the Benchmark

Dataset and Evaluation

Resource	Description	Link
MMMLU Dataset	Full 197K-question dataset across 14 languages on Hugging Face	huggingface.co/datasets/openai/MMMLU
Official Results	Benchmark results with scores for all models and languages	github.com/openai/simple-evals
Community Leaderboard	Interactive HuggingFace Space for exploring multilingual results	Multilingual MMLU Leaderboard

Related Resources

Resource	Description	Link
Original MMLU	The English-only base benchmark (57 subjects, ICLR 2021)	arxiv.org/abs/2009.03300
MMLU Dataset	Original English dataset on Hugging Face	huggingface.co/datasets/cais/mmlu
MGSM	Multilingual Grade School Math — another multilingual reasoning benchmark	arxiv.org/abs/2210.03057

Load the Dataset

from datasets import load_dataset

# Load all languages
dataset = load_dataset("openai/MMMLU", split="test")

# Load a specific language
dataset_fr = load_dataset("openai/MMMLU", "FR_FR", split="test")

Understanding the Metric

Accuracy (Zero-Shot Chain-of-Thought)

Models are evaluated using zero-shot chain-of-thought prompting — no few-shot examples, no role-playing prompts. The model receives a multiple-choice question in the target language and must select the correct answer (A, B, C, or D).

Approach	Description
Zero-shot	No examples provided — tests raw capability
Chain-of-thought	Model can reason step by step before answering
Per-language scoring	Accuracy computed separately for each of the 14 languages
Average score	Mean accuracy across all 14 languages

Why Professional Human Translation Matters

graph LR
    A["Machine Translation<br/>Errors in low-resource<br/>languages"] --> B["Unreliable<br/>benchmark<br/>scores"]
    C["Professional Human<br/>Translation<br/>(MMMLU approach)"] --> D["Accurate, culturally<br/>appropriate<br/>evaluation"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333

Machine-translated benchmarks often contain errors that disproportionately affect low-resource languages, making it unclear whether poor performance reflects model weakness or translation quality. By using professional human translators, MMMLU isolates the variable being tested: the model’s actual multilingual capability.

Why MMMLU Matters

graph LR
    A["English-only<br/>benchmarks"] --> C["MMMLU<br/>14 languages<br/>human-translated"]
    B["Machine-translated<br/>benchmarks<br/>(unreliable)"] --> C
    C --> D["True multilingual<br/>AI performance<br/>measurement"]
    C --> E["Exposes gaps<br/>for underserved<br/>language communities"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

Exposes the multilingual gap — Even the best model drops 13+ points between its strongest and weakest language, revealing that “multilingual” models are far from language-equitable
High-quality human translations — Professional translators ensure the benchmark tests the model, not the translation quality
Low-resource language visibility — Yoruba, Swahili, and Bengali scores expose the real-world readiness (or lack thereof) of LLMs for billions of speakers
57-subject breadth — Tests knowledge and reasoning across the full academic spectrum, not just narrow domains
Practical deployment signal — Organizations deploying AI globally need to know exactly how much performance they lose in each language

Video: MMMLU Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

MMMLU sets the standard for multilingual AI evaluation:

14 languages, 57 subjects, ~197,000 questions — the most comprehensive professionally translated multilingual knowledge benchmark
Professional human translators ensure accuracy, especially for low-resource languages where machine translation fails
The best model (o3 high) averages 88.8% — but drops to just 78.0% on Yoruba, exposing a 13+ point multilingual gap
Smaller models suffer disproportionately — GPT-4.1 Nano scores 73.4% on Italian but 45.5% on Yoruba, near random chance
Low-resource languages need urgent attention — Yoruba and Swahili consistently lag 5–15% behind high-resource languages across all models

As AI goes global, MMMLU provides the essential reality check: how well does your model actually work for the world’s diverse language communities? For most languages, the answer is “significantly worse than English” — and for low-resource languages, the gap is alarming.

References

OpenAI. “Multilingual Massive Multitask Language Understanding (MMMLU).” Hugging Face Dataset. huggingface.co/datasets/openai/MMMLU
OpenAI. “Simple Evals — MMMLU Benchmark Results.” GitHub. github.com/openai/simple-evals
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. “Measuring Massive Multitask Language Understanding.” ICLR 2021. arXiv:2009.03300 (2020). arxiv.org/abs/2009.03300

Explore the hardest AI benchmark ever built — see Humanity’s Last Exam (HLE)
Test graduate-level science reasoning — see GPQA Diamond
Measure abstract reasoning and fluid intelligence — see ARC-AGI-2
Evaluate mathematical reasoning across competitions — see MathArena
Assess competitive programming skills — see LiveCodeBench Pro
Evaluate chart and figure understanding — see CharXiv Reasoning